-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CPU shaders #374
Add CPU shaders #374
Conversation
This commit adds a full pipeline of CPU shaders for the compute stage of the pipeline (not counting fine rasterization). It's based on what's in the multi branch, but also ports bounding box fixes and other cleanups. Currently the selection of shaders is all or nothing (based on the use-cpu flag), but will probably become more fine-grained. Going forward, the intent is to keep CPU and GPU shaders in sync. It has been quite valuable in diagnosing problems in the pipeline.
Can the CPU shaders be UNLICENSE too, like their GPU counterparts? I'm targeting another port of Vello into the Gio graphics pipeline and one of the features we absolutely need is a CPU fallback.
May I ask what the plans are for the fine rasterization stage? Will Vello never run on platforms without GPU support for compute? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. I've not read through the shader code
} | ||
|
||
pub enum TypedBufGuardMut<'a, T: ?Sized> { | ||
Slice(&'a mut T), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this variant ever constructed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, I'd have to check.
/// Second, the actual mapping to CPU shaders is not really specific to | ||
/// the engine, and should be split out into a back-end agnostic struct. | ||
pub fn install_cpu_shaders(&mut self, engine: &mut WgpuEngine) { | ||
engine.set_cpu_shader(self.pathtag_reduce, cpu_shader::pathtag_reduce); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh neat - this is much less noisy
impl Transform { | ||
pub fn apply(&self, p: Vec2) -> Vec2 { | ||
let z = self.0; | ||
let x = z[0] * p.x + z[2] * p.y + z[4]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not checked these operations, but am reminded of #283
Adds Unlicense to the SPDX line, and also relaxes the 64k (or any) limit on the number of pathtags in the CPU case. The CPU shaders don't exactly match the GPU in the "use_large_path_scan" case, but they do work.
Ok, I've fixed the crashes with large pathtags (though it's a bit ugly, the CPU doesn't quite match the GPU; the latter would take a lot of repetitive code that wouldn't be required if Metal didn't suck so hard). I've also improved one of the comments. There's a draft of fine rasterization in the PR, I just haven't finished it. I should warn, while I think these can be adapted to reasonably performant fallback implementations, they're not currently optimized for performance; that might be downstream work, which I think makes sense after the pipeline has stabilized more. I think we may well do CPU fine rasterization, but there are a lot of details that are unclear, including whether we want to do it multithreaded and/or use SIMD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making progress with the big CPU port. I reviewed all the shaders as carefully as I could and left a number of comments/suggestions/questions.
Aside from some questions I have around the subtle differences from the WGSL, I'm OK with merging this.
let draw_monoid = draw_monoids[element_ix]; | ||
let mut clip_bbox = [-1e9, -1e9, 1e9, 1e9]; | ||
if draw_monoid.clip_ix > 0 { | ||
clip_bbox = clip_bbox_buf[draw_monoid.clip_ix as usize - 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well address the TODO in binning.wgsl
and add this assertion here?: assert!(draw_monoid.clip_ix - 1 < config.layout.n_clip)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also lower the length of the slice to that value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That latter question is a deep one. It is clearly more idiomatic Rust, but I think the goal here is to match the GPU as much as possible, which to me reads as putting in the assert.
src/cpu_shader/coarse.rs
Outdated
let mut tile_state = TileState::new(this_tile_ix); | ||
let blend_offset = tile_state.cmd_offset; | ||
tile_state.cmd_offset += 1; | ||
// Discussion question: do these belong in tile state? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment referring to moving clip_depth
and clip_zero_depth
into the TileState
struct? If so, they're mainly used for control flow here and I'm not sure if moving them there and manipulating them through methods would obscure the logic too much. I can imagine moving most of the DrawTag
handling to TileState
methods but I'm not sure about the utility of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the comment. This was part of an internal dialog about whether to process one tile at a time (as the code currently does), or loop through chunks of 256 draw objects (which would be closer to the GPU model). If you were to go that direction, you'd need a struct to save tile state so you could suspend and then resume with the next chunk.
I'm reasonably satisfied with the current state of the code, as it's closer to the WGSL source, even if it doesn't match the execution model. The WGSL source looks like an ordinary scalar variable but is of course a set of VGPRs.
src/cpu_shader/coarse.rs
Outdated
|
||
if bin_tile_x + tile_x < width_in_tiles && bin_tile_y + tile_y < height_in_tiles { | ||
ptcl[tile_state.cmd_offset as usize] = CMD_END; | ||
let scratch_size = 0; // TODO: actually compute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this is referring to the blend depth computation? Maybe mention "blend depth" in the comment so that it resonates with the TODO the BEGIN_CLIP
case above.
src/cpu_shader/draw_leaf.rs
Outdated
let path_ix = !ix as i32; | ||
clip_inp[m.clip_ix as usize] = Clip { ix, path_ix }; | ||
} | ||
m = m.combine(&DrawMonoid::new(tag_word)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this line should move to line 38 above where the exclusive prefix sum gets stored into the draw_monoid
buffer. In the GPU version the prefix sums get computed up front -- while the control flow is very different there it's nice that the logic is grouped together.
You could store m.path_ix
and m.clip_ix
into temporaries before combining the monoids. If you prefer to keep things the way they are then perhaps add a comment near line 34 that mentions that m
gets prepared for the "next invocation" at the bottom of the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it a slightly different way which I hope is clear, lemme know if you like it.
src/cpu_shader/draw_leaf.rs
Outdated
let tag_raw = if ix < config.layout.n_draw_objects { | ||
scene[(drawtag_base + ix) as usize] | ||
} else { | ||
0 | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth defining a CPU version of read_draw_tag_from_scene
in util.rs that you can reuse from here and draw_reduce
?
let a = (b + 0.5 * (d * d - c * c) - xmin) / (xmax - xmin); | ||
area[yi * TILE_WIDTH + i] += y_edge + a * dy; | ||
} | ||
} else if y_edge != 0.0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic (considering the nuance in line 85) all seems to be equivalent to the GPU version but I had to squint at the subtle branch shortcut here a bit before it made sense :) It's all good, it makes sense to avoid a NOP loop when running on the CPU.
src/cpu_shader/path_count.rs
Outdated
let c = s0.x * sign - xt0; | ||
let y0 = s0.y.floor(); | ||
let ytop = if s0.y == s1.y { s0.y.ceil() } else { y0 + 1.0 }; | ||
let b = dy_dxdy * c + a * (ytop - s0.y); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for the different formulations between the WGSL and Rust versions (dy_dxdy
/idxdy
, a
, b
, etc)? Should they be the same?
I'm also slightly confused about a
. It's defined as 1.0 - dy / (dx + dy)
here but as dx / (dx + dy)
in WGSL. It's then used in the same calculations below. Are these equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this question is incredibly deep water. The short answer is that you're looking at an experiment to improve numerical robustness. The ultimate solution to the artifact was to discard horizontal line segments aligned to grid boundaries before it gets to this point. What happened there is a = 1 - epsilon, b = 0, while you need a = 1 for conservative line rasterization to generate the right answer. In CPU land, you can use a division instead of multiplying by reciprocal to make that case work, and use the fact that dx/(dx + dy) + dy/(dx + dy) = 1 to avoid having to compute two divisions. In GPU land, WGSL doesn't give you that guarantee.
What I think is the best path going forward is to undo that experiment and make CPU match GPU. It's possible (especially with carefully constructed adversarial examples) that you might trigger incorrect behavior - in fact, this is one of the things I'll look at for #373. But you're absolutely right that CPU and GPU should match unless there's a compelling reason otherwise.
src/cpu_shader/path_tiling.rs
Outdated
let dy_dxdy = dy / (dx + dy); | ||
let a = 1.0 - dy_dxdy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same question as before (why the different formulations between WGSL and Rust?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the round trip and addressing comments. I think we can move forward with this PR unless others have any concerns.
This commit adds a full pipeline of CPU shaders for the compute stage of the pipeline (not counting fine rasterization). It's based on what's in the multi branch, but also ports bounding box fixes and other cleanups.
Currently the selection of shaders is all or nothing (based on the use-cpu flag), but will probably become more fine-grained.
Going forward, the intent is to keep CPU and GPU shaders in sync. It has been quite valuable in diagnosing problems in the pipeline.